What I Learned from Writing a Custom Parser

Shlomi Fish on 2007-06-08T19:11:31

I blogged about XML-Grammar-Screenplay here before. Right now, seeing that Parse::RecDescent is on the slow side, and that it's error reports are non-existent, I've decided that I want to write my own custom ("Quick-and-Dirty") parser. So I svk cp'ed my main test file, that tests the conversion of the proto-format to XML and validates the XML, and started testing it with the custom parser.

Now, since I could not write the custom parser at once, I decided to incrementally add more tests as I implement them. For this, I used the following:

# TEST:$num_texts=6
@tests = splice(@tests, 0, 6);

This is the first time I recall using splice() in production. I needed it so the tests' counter (that makes use of Test-Count) will be easily capable of being synchronised with the actual number. This is while not using something ugly like @tests = @tests[0 .. (6-1)]. splice()'s substr()-like semantics made it ideal in this case.

Anyway, I gradually added more tests and implemented them in the custom parser. When planning it I wanted to write a parser that was event based, with a dedicated user-land stack, and without functions calling each other arbitrarily. However, I found it comfortable to implement it using functions calling each other for each logical unit of the text, and it still stayed this way. Thus, you can say it's a classic ad-hoc Recursive-Descent parser, although a non-backtracking one.

One of the objectives for the parser was to make sure it does not backtrack. That's because it's completely unnecessary with my defined syntax, and because it is a possible reason for Parse::RecDescent's slowness (especially on a slightly invalid syntax, which causes it to frantically try different options).

The parser is now parsing a subset of the functionality and is getting there. It indeed feels very "quick-and-dirty" and raw. Perhaps a better design would be to use some kind of a state machine or a decision tree. I also found a use for redo in the first time in a long while, as I had a while ... continue { $self->next_line() } construct, and was trying to avoid executing the continue {...} block.

One catch is that I split the text into lines, and handle one line at the time. Due to the sytnax, it sometimes complicates matters and sometimes facilitates them.

It may actually be harder than having written the P::RD-based parser, and is time consuming. I often find that I have to take rests in between and do something which is more straightforward to do. But it is a fun, in a rather braindead way. At least when it's done, I (and others) would be able to work on my screenplays with less frustrations and with more style, which is why I'm doing it.